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Abstract- Converting a word to its 
original form, is called stemming, 
which is extremely important in the 
field of Natural language processing 
(NLP). It’s an integral part of the 
linguistic pre-processing of every 
Natural language processing 
application. Stemming converts 
inflectional word forms into their 
root word. Much work has been done 
for stemming in different national 
and regional languages like English, 
French, Arabic, German, Urdu, and 
Hindi. Many regional languages still 
need work to build digital resources 
using Natural language processing. 
Saraiki is one of the widely spoken 
regional languages in Pakistan. 
Almost eighty million people use this 
language for communication. There 
are very limited digital resources 
using the Saraiki language available 
to support advancement in Natural 
language processing technologies. 
The current research aims to 
propose a hybrid stemmer to stem 
Saraiki Work. The hybrid stemmer 
contains two hundred prefix and 
postfix rules and Long short-term 
memory based sequence-to-sequence 
model for converting Saraiki words 
into the stem. Firstly, Saraiki text 


was pre-processed, and a rule set was 
implemented. Secondly, the Long 
short-term memory based sequence- 
to-sequence model was deployed to 
stem the Saraiki word correctly. In 
the last step, The Saraiki Stemmer 
performance was evaluated by 
accurately finding stem word 
accuracy using a rule-set and Long 
short-term memory sequence to 
sequence model. After experiments, 
using the rule set correctly, stem 
word accuracy was 68.53%, while 
the Long short-term memory based 
sequence-to-sequence model 


produced 93.0% accuracy of 
correctly stem words. This work 
contributes significantly to the 
regional linguistic field by 


introducing stemmer for the Saraiki 
language. 


Index _Terms-Hybrid 
LSTM, Rule-based 
Saraiki, Stemming 


Stemmer, 
Stemmer, 


I. Introduction 


The current research is 
categorized into  sub-sections. 
Section | depicts the introduction of 
the Saraiki language, its history, 
origin, and varieties. It includes the 
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role of Natural language processing 
(NLP) research for low-resource 


languages such as Saraiki 
Language. This section also 
includes Sarariki language 


alphabets, facts about stemming, 
and its types. 


A. Overview of Saraiki Language 


Saraiki is an Indo-Aryan 
language widely used in Pakistan 
and India [1]. Saraiki is the only 
language of eighty million people in 
Pakistan, ranging across all four 
provinces of Pakistan, with a 
majority of speakers in southern 
Punjab [2]. Saraiki originated from 
the word “Sauvira,’ an ancient 
kingdom also mentioned in the 
Sanskrit epic Mahabharata [3]. 


Saraiki language is considered 
as a dialect of Punjabi by most 
British Colonial Administrators 
[4].There are several varieties of the 
Saraiki language, such as central 
Saraiki, which is most famous and 
spoken in Multan, Muzaffargarh, 
Bahawalpur, and Dera Ghazi Khan, 
Pakistan [5] . The other varieties are 
southern Saraiki, Sindhi Saraiki, 
Kulachlwall Saraiki, Northern 
Saraiki, and Eastern Saraiki. All 
these varieties have minor regional 
variations in punctuation [6], [7]. 


produce human language content 
[8]. Early computational approaches 
to language research focused on 
automating the analysis of the 
linguistic structure of language and 
developing basic technologies such 
as machine translation, speech 
recognition, and speech synthesis 
[9], [10]. Today’s researchers refine 
and use such tools in real-world 
applications, creating spoken 
dialogue systems, and speech-to- 
speech translation engines, mining 
social media for information about 
health or finance, identifying 
sentiment, emotion toward 
products, and services [11]. 


The major challenge for 
researchers in the field of NLP is 
low-resource languages such as the 
Saraiki language. NLP research may 
focus on creating novel language 
resources and benchmarks; some 
may customize existing NLP 
solutions to new languages and 
domains [12]. In parallel, there can 
be research that actively explores 
new NLP techniques that could 
generalize to different low-resource 
setups - in terms of data availability 
and the availability of 
computational resources [13]. The 
prime purpose of this paper is to 
develop a Saraiki text stemmer that 
would eventually open a door for 


Natural Language Process E 
(NLP) employs computational further analysis of the texts written 
techniques to learn, understand, and H the Saraiki language. 

School of System and Technology l $ U MT 19 


Volume 2 Issue 2, Fall 2022 


Saraiki Language Hybrid Stemmer... 


B. Saraiki Language Alphabets 


There are around eighty million 
native language users in Pakistan 
and India only. It is written in Perso- 
Arabic script; however, it has its 
own set of alphabets that consists of 
45 letters. Of these forty-five letters, 
thirty-nine are the same as the Urdu 


language, and six are additional 
letters. However, some researchers 
consider it as a dialect of the 
standard Punjabi language; 
whereas, it is a separate language 
with its own identity [14], [15]. 
Table I shows the alphabet of the 
Saraiki language and its variation of 
sounds in English. 


Table I 
Saraiki Language Alphabets and Sounds 
Saraiki English Saraiki English Saraiki English 
Alphabet Sound Alphabet Sound Alphabet Sound 
| a 5 T < 8 
= b E Z J l 
= p 3 zh ê m 
a t LN s Q n 
ral t us sh 3 Vv, 0, ort 
a th ua 8 20.5 h 
Z j ua Z A h 
id ch i t c i 
‘a h L Z E y, i 
Z kh € ' a ai ore 
à d a gh 
3 d a) f 
à dh 3 q 
J r cS k 


C. Stemming and Types 


Stemming is the linguistic 
process of converting inflected 
forms of words to their basic form 
[16]-{18]. Stemming is a 
challenging task for a large number 
of ambiguous structure languages 
and poor resource languages [19], 
[20]. Stemming has been seen from 


various perspectives by academics 
working in information retrieval 
(IR) [21]. Stemming can improve 
the performance of text processing 
because this process would merge 
words that have the same root. If 
words have the same roots, they are 
considered to have a semblance of 
meaning. Therefore, documents 
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with words with the same roots are 
considered relevant so that the 
steaming process would reduce the 
features dimension of he 
documents [22]. It's a linguistic 
procedure in which the numerous 
morphological forms of a word are 
mapped to a single word form, 
known as the infinitive form [23]. A 
stemmer, for example, maps the 
words: maintained, maintaining, 
and maintenance from the root word 
maintain’. Similarly, in the Saraiki 
language: Words like 4s -2 

eas O45 can be reduced to the 
root word $<. 


Stemming algorithms have been 
developed during the last few 
decades to automate the process 
which requires different approaches 
as it is mostly language-specific 
[Stemming algorithms were 
classified into statistical stemmers 
and Rule-based stemmers. 
Statistical stemmers, also known as 
unsupervised stemmers, use training 
data for performing stemming. 
Rule-based stemmers use a set of 
rules defined to perform stemming 
[24], [26]. 


The statistical stemmers are 
inaccurate and fail to take advantage 
of some language phenomena that 
simple rules can easily express. On 
the other hand, handcrafting the 
stemming rules in the rule-based 
stemmers is a time-consuming, 
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tedious, and impractical task [27]. In 
this paper, a hybrid stemmer was 
proposed to develop a linguistics 
resource for the poor language 
Saraiki. 


In the next section, different 
researchers’ contributions to the 
development of stemming 
algorithms are presented in detail. 


II. Literature Review 


Stemmers play an important role 
in improving the performance and 
efficiency of any IR system. 
Researchers have contributed a lot 
to empower this area of linguistics 
for a better understanding of 
different languages by using 
modern tools and techniques. The 
current section describes the 
contribution of researchers in 
stemming algorithms proposed for 
various national and international 
languages such as English, Arabic, 
Persian, Hindi, Urdu, Sindhi, 
Hindko, and many others. The 
review of literature is divided into 
three sub-sections: Rule-based 
stemming, statistical stemming, and 
hybrid stemming. 


A. Rule-based Stemmers 


A Marathi Language stemmer 
was proposed in 2022. Supervised 
ML techniques adopted. 
Handwritten grammar rules and 
word dictionary of Marathi 
language was developed. The 
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proposed system produced 61.36% 
accuracy [28]. A Telegu language 
stemmer was proposed in 2022. 
Later rules were developed to 
analyze Telegu text [29]. A Sindhi 
Language stemmer was proposed in 
2021, and Lexicon-based affix 
removal was deployed through 
stemming algorithms. Thirty-Eight 
linguistics rules and fifty thousand 
words lexicons were adopted for the 
steaming process. The stemmer 
produced an accuracy of 84.85% 
[30]. A Punjabi Language rule- 
based stemmer was proposed in 
2020. Brute-force and suffix- 
stripping techniques were deployed. 
A lexicon containing 1762 root 
words was taken as an input. The 
stemmer stems 1564 words properly 
produced 88.75% accuracy [31]. A 
Bengali language rule-based 
stemmer was proposed in 2020. 
WordNet was used to produce better 
outcomes. A Bengali corpus 
containing 500 sentences was 
adopted. The stemmer produced 
98.86% accuracy for nouns and 
99.75% for verbs [32]. 


A Sinhalese language rule-based 
stemmer was proposed in 2020 for 
employing suffix and prefix rules. 
Stemming effectiveness was 
evaluated using Naïve Bayes, 
Random Forest, and Support Vector 
Machine classifiers. Two datasets 
containing words related to cricket, 
rugby, football, athletics, and 


academics were used. The stemmer 
produced promising results 
[31].[33]. An Assamese language 
rule-based stemmer was proposed in 
2019 using the suffix stripping 
method, and WordNet was adopted 
for stemming. Rules were 
developed for nouns and verbs. The 
stemmer achieved 85% accuracy 
[34]. A Sundanese language rule- 
based stemmer was proposed in 
2018. A dataset of 4,453 Sundanese 
unique attached words was accessed 
from 70 Balebat articles and 68 
Dewan Dakwah Jabar articles for 
the experimental study. The stem 
and affix sequences are manually 
identified in the dataset. The results 
showed that stemmer exceeds the 
modified baseline regarding 
correctly stemmed affixed words 
and identified attached type 
accuracy. Stemmer successfully 
affixed 68.87% of Sundanese 
affixed types and created 96.79% of 
correctly affixed words [35]. 


An Urdu language rule-based 
stemmer was proposed in 2017. Six 
infix word classes were created. 
Four corpora were used to create a 
directory of words. These corpora 
contain words related to Urdu news 
headlines, politics, weather, sports, 
Urdu dictionaries, and grammar 
books. The stemmer achieved 
87.4% accuracy [36]. A Ge'ez 
Language stemmer was introduced 
in 2017. An affix removal approach 


Innovative Computing Review 


Volume 2 Issue 2, Fall 2022 


Malik et al. 


was deployed. The stemmer was 
based on evaluating over-stemming, 
under-stemming, and structured 
laws. Manual error counting was 
used for evaluation. The stemmer 
achieved 75.95% accuracy [36]. 
Another Gujarati language rule- 
based stemmer was proposed in 
2014. Rules were defined for 
Gujarati text EMILLE corpus was 
used to evaluate the stemmer. The 
stemmer achieved 92.41% accuracy 
[37]. A Nepali language rule-based 
stemmer was proposed in 2014. 
Affix stripping techniques were 
adopted for stem extraction from the 
Nepali text. A corpus of 1800 words 
was used for the evaluation of the 
stemmer. The stemmer produced 
90.48% accuracy [38], [39]. 


B. Statistical Stemmers 


The employment of statistical 
tools is another prevalent option. In 
such techniques, the system learns 
from the existing corpus. It makes 
decisions about unknown material 
based on knowledge gained through 
experience, for instance, statistical 
measurements learned through 
experience are utilized to eliminate 
prefixes and affixes. However, 
Statistical approaches are widely 
used in the language processing 
community. N-gram-based statistics 
are used by W. B. Frakes [40]. 


by Majumder [42] as a few 
examples. Such techniques are 
restricted in terms of gaining 
experience. In other words, such 
techniques seldom cover the 
language's whole grammar [43].The 
N-gram technique, for example, is 
superior at removing only suffixes. 
Still, the Urdu language contains 
suffixes,  co-suffixes, prefixes, 
infixes, and circumfixes (prefixes 
and suffixes simultaneously) [44], 
and the use of HMM requires that 
every word must start with the 


prefix [41]. This review 
demonstrates that statistical 
techniques for stemming are 


insufficient. The book contains a 
complete study of stemming 
techniques used in Urdu, Persian, 
and Arabic [45]. 


A Gujarati Language statistical 
stemmer was proposed in 2021. A 
supervised machine learning 
algorithm was adopted to assess the 
accuracy of the web page 
classification. The stemmer 
eliminates the inflectional or 
derivational form of words to its 
root stem. The stemmer produced 
97% accuracy [46]. A Marathi 
Language statistical stemmer was 
proposed in 2021, namely MT 
stemmer. The stemmer focused on 
removing suffixes to retrieve the 
root word form gender-based 


Melucci [41] used HMMs and n 
YASS: Yet another suffix stripper suffixes and auxiliary verb-based 
suffixes namely two-stage 
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stemming were performed by 
stemming. The average precision, 
recall, and F-1 score was 0.706, 
0.806, and 0.753, respectively [47]. 
An Arabic language statistical 
stemmer was introduced in 2021. 
The major focus was the infix 
patterns of Arabic word lists. 
TREC2002, a standard dataset, was 
used for text retrieval. The 
performance of the stemmer was 
evaluated using three stemmers 
Condlight, Light10, and ARLS. The 
proposed Arabic stemmer produced 
promising results with precision, 
recall, F-measure, and ICF at 60%, 
79%, 68%, and 81%, respectively 
[48]. 


C. Hybrid Stemmers 


A hybrid stemmer combines the 
rule base stemmer and statistical 
stemmer [49]. A hybrid Urdu 
language stemmer was proposed in 
2020. Word and text corpus was 
used as dictionaries. Affix- 
stemming rules were deployed. The 
stemmer produced promising results 
[31] [33]. A Gujarati hybrid 
stemmer was proposed in 2017 for 
removing affixes, from Gujarati 
words and get optional results. The 
Gujarati stemmer searches an online 
Gujarati dictionary for the stem 
word. EMILE corpus was used for 
stemming. The stemmer produced 
97.09% accuracy [50]. A Punjabi 
language hybrid stemmer was 


oo | y R 


proposed in 2017. A dataset 
containing 2.5 million tokens was 
used. A rule set contains sixty-three 
affixes rules created. The stemmer 
achieved 86.01% accuracy [51]. 


A hybrid stemmer for the 
Persian language was introduced in 
2015. The affix removal method 
was adopted. Intervening and 
Makassar wordlists of Persian 
words were taken. The stemmer 
produced 97% accuracy [52]. An 
Arabic language hybrid stemmer 
was proposed in 2013. An Arabic 
language dictionary containing Nine 
thousand words was used. Prefix 
and suffix rules were deployed. The 
stemmer produced promising results 
[53]. A Gujarati hybrid stemmer 
was proposed in 2011. A suffix 
stripping technique was adopted. 
Part of the module was also used. 
The stemmer achieved promising 
results [54]. 


Hence, it is concluded that there 
exist many resource-poor languages 
lacking linguistics research Saraiki 
is one such language, in which 
limited research has been 
conducted. So, there is a dire need to 
develop resources for Saraiki 
Language to conduct a proper 
research study on texts written in 
Saraiki language. Therefore, a 
hybrid Saraiki language stemmer is 
proposed in the current research. 
The next section shows the 
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proposed methodology for Saraiki 
Language hybrid stemmer. 


HI. Materials and Methods 


This section contains the 
proposed methodology for a hybrid 
Saraiki language stemmer. This 
section includes the proposed 
algorithm, corpus creation, rules 
creation, and sequence-to-sequence 
model. 


Text Preprocessing 


Saraiki 
Corpus 


Rule Based Stemming Module 


Rule Generator 
(Prefix — Postfix) 


Tokenization s Remove Remove Digits 
Remove Spaces $ 
» Punctuations 


The model adopted for this work 
is shown in figure Firstly, 
preprocessing techniques such as 
tokenization, space removal, 
punctuation removal, and digital 
removal are applied to Saraiki 
Corpus. Secondly, preprocessed 


data was processed through the 
Rule-based stemming module and 
LSTM sequence to sequence model 
for generating stem words. Finally, 
the results were compared for the 
two techniques. 


LSTM Sequence to Sequence Model 


SARAKI STEM WORD 


Fig. 1. Proposed model for Saraiki language hybrid stemmer 


A. Saraiki Language Corpus 


A large amount of text was 
required to build the Saraiki 
language corpus. It was not easy to 
collect Saraiki language text as there 
is no such publicly available dataset 
of Saraiki language for stemming 
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purposes. The dataset was manually 
annotated by creating prefix and 
postfix word lists. The dataset 
contains 10000 Saraiki words taken 
from Saraiki dictionary and 
documents available on a different 
website. Table II shows the Saraiki 
language dataset words. 
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Table II 


Saraiki Language Dataset 


Saraiki 


Words Prefix Postfix 
Oe Xl Oo 
UY jie Jj J) 
as A31 Ç 
CaS 2 Ñ 
Bags ae E 
US 5 es uk 
laa 5a a} lo 
ÖS us Gs 
ee AJ 
3S oS 3 
2s 5 P= 
É oS 3 
gS o3 N 
zak ar £a 
Luha ha Ll 
cl oh ol 
cih ük G 
JagSus os lo 
tsi 5) L 
Lh da L 
alal Js! 8" 
OLS us ol 
clee eS G 
Jul os ee 
ai ql i) 


The taken was pre-processed 
before the creation of Saraiki words 
lists. Word tokens were created, and 
punctuations, special characters, 
extra spaces, and digits were 
removed. Finally, the Saraiki 
language corpus was developed for 
performing the stemming linguistic 
process. 


B. Rule-based Stemming Module 


Saraiki language hybrid 
stemmer is based on two modules. 
Firstly, the rule-based stemmer 
module. This module deploys prefix 
and postfix rules on Saraiki corpus 
to stem Saraiki words. Two hundred 
fifty postfix and prefix rules were 
created. A minimum word length 
rule was also created. After the word 
has been pre-processed, prefix and 
postfix rules are deployed. After the 
applicable rule's determination, the 
word is broken down, and a list of 
possible words is generated. If 
relevant rules unable to determine, 
the same term as the root word is 
returned. To discover frequencies, 
the possibilities list is compared 
with the dataset. The word with the 
highest frequency is returned as a 
root or stem word. The most often 
occurred word has the highest 
possibility of being the root or stem 
word. The following are some of the 
postfix and prefix rules. 


Innovative Computing Review 


Volume 2 Issue 2, Fall 2022 


Malik et al. 


C. Minimum Word Length Rule 


After a detailed analysis of 
Saraiki morphology, it is observed 
that a Saraiki word comprised of 
only two or three characters which 
is already a stem word. For example, 
the word us (You), 2: (Elder) are 
already stem words. These words 
are treated as stem words and 
filtered out to avoid further 
stemming processing. The finding 
of this rule is a novel contribution of 
the proposed Saraiki stemmer. 
Some example words identifying 
this rule are given in Table III. 


Table HI 


Example of Words Handles by 
minimum Word Length Rule 


os us a oD 
oS gh hy ola 
Us Ja isla 4h 
es Ql ay Gs 


D. Postfix and Prefix Rules 


The following showed some 
postfix and prefix rules for 
stemming Saraiki words. Two 
hundred and fifty rules deployed 
using hybrid Saraiki language 
stemmer. 


Rule no 1: If the token has ending 
characters ù then remove ò from 
the end. For example oN to 4%. 
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Rule no 2: If the token has ending 
characters ol, then remove o! from 
the end. For example, JY j to Jj. 


Rule no 3: If the token has ending 
characters 5, remove © from the end. 
For example Ga%S to ass, 


Rule no 4: If the token has ending 
characters ,4 then remove 3 from the 
end. For example Use to ce. 


Rule no 5: If the token has ending 
characters uz: then remove vu: from 
the end. For instance, ussis to Bs. 


Rule no 6: If the token has finalizing 
characters 12 then remove !3 from the 
end. For example, 125 5; to 452. 


Rule no 7: If the token has ending 
characters s, then remove 6» from 
the end. For example sks to KS, 


Rule no 8: If the token has ending 
characters . then remove .~ from 
the end. For example «Xi to 45. 


Rule no 9: If the token has ending 
characters s, remove s from the end. 
For instance, s£ to oS. 


Rule no 10: If the token has ending 
characters + then remove 2 4 
from the end. For example, 2 4S to 


5 


Rule no 11: If the ticket has ending 
characters ,’3) then remove 3! from 
the end. For example, 33 to o3. 


Rule no 12: If the token has ending 
characters <4 then remove << from 
the end. For example, 2h to dh, 
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Rule no 13: If the token has ending 
characters «~, remove o from the 
end. For example “7 to Gin, 


Rule no 14: If the token has ending 
characters b! then remove +! from 
the end. For example LRA to ha, 


Rule no 15: If the token has ending 
characters «a then remove -. from 
the end. For example J XL to ob. 


Rule no 16: If the token has ending 
characters 2 then remove 2 from the 
end. For example LS! to ¢S. 


Rule no 17: If the token has ending 
characters 4 then remove 4 from the 
end. For example a< to dh. 


Rule no 18: If the token has ending 
characters Ub then remove ol from 
the end. For example, seh Ë to oL Ë. 


Rule no 19: If the token has ending 
characters s, remove ~_ from the 
end. For example 44% to =i, 


Rule no 20: If the token has ending 
characters s% then remove o> from 
the end. For example 4S! to 25). 


Rule no 21: If a word starts with +) 
bay+daal) then remove 2341 bay+daal) 
from beginning. SJs} — ©) sa 
For example (badsirat) (strat) 


Rule no 22: If a word starts with —) 
bayt+badi- ye), then remove -) 
bay+badi-ye) from beginning - J&S 
2 For example (békdar) (kadar) 


E. LSTM-based Sequence to 
Sequence Model 


Sequence to sequence is the 
most efficient approach Tor 
automatically converting the script 
of a word from a source sequence to 
a target sequence. Sequence to 
sequence modeling is one of the 
intriguing applications of NLP. 
Long Short-term Memory (LSTM), 
which is a special kind of RNN 
(Recurrent Neural Network) [55]. 
LSTM networks are suitable for 
analyzing sequences of text data and 
to predict the next word. LSTM 
could be a good solution if you want 
to indicate the next point of a given 
time sequence [56]. 


In this work, LSTM based 
sequence to sequence model is 
deployed for stemming Saraiki 
words. The language translation 
example is shown in Figure 2. 


F. Character-based Sequence to 
Sequence LSTM 


Previous researches have often 
used a pre-processing phase to 
extract words from source 
sequences and develop a vast 
vocabulary that aids in the one-hot 
transformation of an input sequence 
into fixed-length vectors. The 
chosen language, however, cannot 
include all of the terms in the data 
set because of the sparsity and high 
dimension of this word-level 
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representation. As a result, special 
tokens are frequently used to 
substitute OOV terms. To achieve 
the final translation results, a post- 
processing step is necessary to 
manage the UNKs in the output 


"The weather is nice" 
t 
LSTM 


sequence. In this research, we 
describe the LSTM encoder and 
decoder, a technique for character- 
level sequence encoding using 
RNN, as given in Figure 3. 


"CSTARTJIL fait beau" 


! 


LSTM 


encoder “| decoder 
—T m Internal LSTM A T g 
T states (h, c) 


"TL fait beau[STOP]" 


Fig. 2. LSTM sequence to sequence model [57] 


F. Character-based Sequence to 
Sequence LSTM 


Previous researches have often 
used a pre-processing phase to 
extract words from source 
sequences and develop a vast 
vocabulary that aids in the one-hot 
transformation of an input sequence 
into fixed-length vectors. The 
chosen language, however, cannot 
include all of the terms in the data 
set because of the sparsity and high 


dimension of this word-level 
representation. As a result, special 
tokens are frequently used to 
substitute OOV terms. To achieve 
the final translation results, a post- 
processing step is necessary to 
manage the UNKs in the output 
sequence. In this research, we 
describe the LSTM encoder and 
decoder, a technique for character- 
level sequence encoding using 
RNN, as given in Figure 3. 


i 
Softmax! 
i 


r------- R R E 
‘ 
i 
i 


Encoder C} a = = > s 
S aaa 
H L Aa 3 


er oe ee a ee 
ss a o E 
sigs 


Decoder! 
i 


Sequen 


Fig.3. LSTM Saraiki stemmer 
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IV. Results and Discussion 


This section discusses the 
performed experiments using a rule- 
based stemmer module and LSTM- 
based sequence to sequence Model. 
The Saraiki language dataset was 
taken for experiments, and results 
were discussed by analyzing the 
performance of the hybrid Saraiki 
stemmer. 


A. Experiment 1 — Rule-based 
Stemmer Module 


In this experiment, prefix and 
postfix rules are deployed on the 
Saraiki language dataset. Firstly, the 
rules were extracted and then 
applied to the given words. The 
rule-based technique gave the 
following results. Ten thousand 
words from the Saraiki language 
dataset were taken, and two hundred 
fifty prefix-postfix rules were 
applied to these words. After 
experiments, it was observed that 
the rule-based stemmer module 
correctly stems six thousand eight 
hundred and thirty-five words, 
while other words were wrongly 
stemmed. The overall accuracy of 
the rule-based stemmer was 
observed at 68.53%. 


The graphical representation of 
correctly and wrongly stem words is 
shown in Figure 4. 


Wrongly Stem 


5000 10000 
Fig.4. Rule-based stemmer - 
correctly and wrongly stem words 


Figure 5 shows the actual 
Saraiki word and predicted stem 
word by rule-based stemmer and 
real stem word. Experimental 
results are also showed in Figure 5 
as true or false matches. 


Similarly, Figure 6 showed 
wrongly predicted Saraiki stem 
words by rule-based stemmer 
module. Finally, after deploying a 
rule-based stemmer module using 
the Saraiki dataset of 10000 words, 
68.53% accuracy was achieved by 
the stemmer. 


B. Experiment 2 — LSTM Based 
Sequence to Sequence Model 


In this experiment, LSTM based 
sequence to sequence model was 
deployed using the Saraiki dataset to 
predict the stem of the Saraiki 
words. The following parameters 
were set before deploying LSTM 
based sequence to sequence model. 
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Word Predictedstem Reaistem Match 


o oy > JO > J > Ture 
1 oy p>. Jo > Jd > Ture 
2 Sess aïs aïs Ture 
3 Sais aïs aïs Ture 
4 5s Ses —= Ture 
5 Ses Şes Jes Ture 
Ss Gee = = Ture 
E Gee: = = Ture 
3 o s z= s zs Ture 
9 o s z= s zs Ture 
10 S ses ses les False 
aa 3S aL a pas False 
12 w2 51 51 5151 51 Ture 
13 w= S's! INN KE. Ture 
14 cht — o = o Ture 


Fig. 5. Rule-based stemmer results 


Jel"; {'predictedStem': 'y!', ‘realStem': 'l'}, 
eal’ {"predictedStem': ‘Jul’, 'realStem's ‘2 jlil'}, 
Juul": {"predictedStem':s "47. 'realStem': 'l'}, 
de y's {'predictedStem': ‘Gey’, ‘realStem': 'a»'}, 
Jude's {"predictedStem': ‘jk', ‘realStem's 's3b'}, 
see's {"predictedStem': ‘yse', ‘realStem': 'a'}, 
use’: {"predictedstem': 'sle', 'realStem': 'ln'}, 
at's {'predictedStem': "32T. 'realstem': ‘ulat'}, 
Joist": {"predictedStem’: ‘ut', 'realStem': ‘e'}, 


Fig. 6. Rule-based stemmer - wrongly predicted words 
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# Batch size for training. num_samples = 10000 


batch_size = 64 After experiments, it was 
observed that LSTM based 


# Number of epochs to train for. 
T f sequence to sequence model 


epochs =100 achieved 99% accuracy, while 
# Latent dimensionality of the validation accuracy was 96%. 
encoding space. Figure 7(a) shows the training 
Viet dim- =256 and validation accuracy 
p l comparison, while Figure 7(b) 
# Number of samples to train on. shows the training and validation 
loss. 
model accuracy model loss 
175 
100] — tan — tain 
p E], 
90 125 
cH 100 
Ñ v 
b 2 
y 0801 £05 
By 0s0 
0.70) 25 
0654 
0.00 
0 0 4) HU il 100 ) b 4 b b 100 


epoch epoch 
Fig. 7: (a) Comparison of training and validation accuracy (b) loss of 
LSTM sequence to sequence model 


Here are some stem words after experiments, while LSTM 
generated by LSTM sequence to based sequence-sequence model 
sequence model stemmer. produced 99% accuracy, as shown 


Therefore, the rule-based in Figure 8. 


stemmer produced 68.53% accuracy 
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Table IV 


Stem Words Generated by LSTM 


Sequence to Sequence Model 
Stemmer 


Input sentence: 
Ors 

Decoded 
sentence: 4% 
Input sentence: 
UY je 
Decoded 
sentence: J ji 
Input sentence: 
Decoded 
sentence: ee 
Input sentence: 
Uns 5 

Decoded 
sentence: @s 
Input sentence: 
IA Sa 

Decoded 
sentence: 43 
Input sentence: 
dañ 

Decoded 
sentence: eS 
Input sentence: 
eX 
Decoded 
sentence: 4% 


Input sentence: 


"A32 
Decoded 
sentence: #38 


Input sentence: 


Glues 
Decoded 
sentence: 2445 


Input sentence: 


gi 


Decoded 
sentence: ¿xí 


Input sentence: 


24S 
Decoded 
sentence: JS 


Input sentence: 


sigh 
Decoded 
sentence: oe 


Input sentence: 


pies 
Decoded 

L 
sentence: o 


Input sentence: 


aa 


Decoded 
sentence: J« 
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LSTM based 
Sequence to 
Sequence Model 


Rule Based 
Stemmer 


Fig. 8. Hybrid Saraiki stemmer 
(rule-based and LSTM-based 
sequence to sequence model 


A. Conclusion 


Stemming is a strategy used in 
IR systems to improve retrieval 
accuracy by addressing the problem 
of document and vocabulary 
mismatch during indexing and 
searching. The model presented in 
this work provides a stemmer 
application for the Saraiki language. 
The stemmer was evaluated using 
two approaches, namely rule-based 
and LSTM sequence to sequence. In 
our experiment, the rules-based 
approach gave 68% accuracy, while 
LSTM gave 93% accuracy. Hence, 
LSTM sequence to sequence is the 
best method to provide accurate 
stemmer for the Saraiki language. 
Future research can enhance the 
result by increasing the size of the 
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